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ABSTRACT 

Two  algorithms  for  reordering  sparse,  symmetric  matrices  or  undirected  graphs  to 
reduce  envelope  and  wavefront  are  considered.  The  first  is  a  combinatorial  algorithm 
introduced  by  Sloan  and  further  developed  by  Duff,  Reid,  and  Scott;  we  describe  en¬ 
hancements  to  the  Sloan  algorithm  that  improve  its  quality  and  reduce  its  run  time. 
Our  test  problems  fall  into  two  classes  with  differing  asymptotic  behavior  of  their  en¬ 
velope  parameters  as  a  function  of  the  weights  in  the  Sloan  algorithm.  We  describe 
an  efficient  (9(nlog  n  +  m)  time  implementation  of  the  Sloan  algorithm,  where  n  is  the 
number  of  rows  (vertices),  and  m  is  the  number  of  nonzeros  (edges).  On  a  collection 
of  test  problems,  the  improved  Sloan  algorithm  required,  on  the  average,  only  twice 
the  time  required  by  the  simpler  Reverse  Cuthill-McKee  algorithm  while  improving  the 
mean  square  wavefront  by  a  factor  of  three.  The  second  algorithm  is  a  hybrid  that 
combines  a  spectral  algorithm  for  envelope  and  wavefront  reduction  with  a  refinement 
step  that  uses  a  modified  Sloan  algorithm.  The  hybrid  algorithm  reduces  the  envelope 
size  and  mean  square  wavefront  obtained  from  the  Sloan  algorithm  at  the  cost  of  greater 
running  times.  We  illustrate  how  these  reductions  translate  into  tangible  benefits  for 
frontal  Cholesky  factorization  and  incomplete  factorization  preconditioning. 
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1  Introduction 


We  consider  two  algorithms  for  reducing  the  envelope  and  wavefront  of  sparse,  symmet¬ 
ric  matrices  or  undirected  graphs.  The  first  algorithm  was  introduced  by  Sloan  [39], 
improved  further  by  Duff,  Reid,  and  Scott  [11],  and  is  currently  the  best  combinatorial 
algorithm  for  this  problem.  We  describe  enhancements  to  Sloan’s  algorithm  that  (i) 
reduce  the  envelope  and  wavefront  size  further,  and  (ii)  reduce  its  asymptotic  time  com¬ 
plexity  and  practical  execution  times.  The  second  algorithm  is  a  new  hybrid  algorithm 
that  combines  an  algebraic  (spectral)  algorithm  for  envelope  reduction  described  by 
Barnard,  Pothen  and  Simon  [4]  with  the  Sloan  algorithm  as  a  post-processing  step.  The 
spectral  algorithm  takes  a  “global”  viewpoint  of  the  problem,  but  could  potentially  be 
improved  by  combining  it  with  a  “local”  refinement  algorithm.  The  spectral  algorithm 
is  known  to  produce  envelope  and  wavefront  sizes  significantly  smaller  than  previous 
algorithms  [4].  The  hybrid  algorithm  further  reduces  the  envelope  size  and  wavefronts 
over  the  spectral  and  Sloan  algorithms.  We  present  a  few  examples  to  show  that  these 
improved  orderings  could  lead  to  faster  frontal  solves  and  more  efficient  incomplete  fac¬ 
torization  preconditioners. 

Sloan  [39]  described  an  implementation  of  his  algorithm  for  unweighted  graphs.  The 
idea  of  Sloan’s  algorithm  is  to  number  vertices  from  one  endpoint  of  an  approximate 
diameter  in  the  graph,  choosing  the  next  vertex  to  number  from  among  the  neighbors 
of  currently  numbered  vertices  and  their  neighbors.  A  vertex  of  maximum  priority  is 
chosen  from  this  eligible  subset  of  vertices;  the  priority  of  a  vertex  has  a  “local”  term 
that  attempts  to  reduce  the  incremental  increase  in  the  wavefront,  and  a  “global”  term 
that  reflects  its  distance  from  the  second  endpoint  of  the  approximate  diameter. 

Duff,  Reid,  and  Scott  [11]  have  extended  this  algorithm  to  weighted  graphs  obtained 
from  finite  element  meshes,  and  have  used  these  orderings  for  frontal  factorization  meth¬ 
ods.  The  weighted  implementation  is  faster  for  finite  element  meshes  when  several  ver¬ 
tices  have  common  adjacency  relationships.  They  have  also  described  variants  of  the 
Sloan  algorithm  that  work  directly  with  the  elements  (rather  than  the  nodes  of  the 
elements).  The  Sloan  algorithm  is  a  remarkable  advance  over  previously  available  algo¬ 
rithms  such  as  Reverse  Cuthill-McKee  (RCM)  [6],  Gibbs-Poole-Stockmeyer  [18,  29],  and 
Gibbs-King  [17]  algorithms  since  it  computes  smaller  envelope  and  wavefront  sizes. 

For  the  most  part,  we  follow  Sloan,  and  Duff,  Reid  and  Scott  in  our  work  on  the 
Sloan  algorithm.  Our  new  contributions  are  the  following: 

•  We  show  that  the  use  of  a  heap  instead  of  an  array  to  maintain  the  priorities  of 
vertices  leads  to  a  lower  time  complexity,  and  an  implementation  that  is  about 
four  times  faster  on  our  test  problems.  Sloan  had  implemented  both  versions, 
preferring  the  array  over  the  heap  for  the  smaller  problems  he  worked  with,  and 
had  reported  results  only  for  the  former.  Duff,  Reid,  and  Scott  had  followed  Sloan 
in  this  choice. 

•  Our  implementation  of  the  Sloan  algorithm  for  vertex-weighted  graphs  mimics 
what  the  algorithm  would  do  on  the  corresponding  unweighted  graph,  unlike  the 
Duff,  Reid,  and  Scott  implementation.  Hence  we  define  the  key  parameters  in  the 
algorithm  differently,  and  this  results  in  smaller  wavefront  sizes. 

•  We  examine  the  weights  of  the  two  terms  in  the  priority  function  to  show  that 
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our  test  problems  fall  into  two  classes  with  different  asymptotic  behaviors  of  their 
envelope  parameters;  by  choosing  different  weights  for  these  two  classes,  we  reduce 
the  wavefront  sizes  obtained  from  the  Sloan  algorithm,  on  the  average,  to  60%  of 
the  original  Sloan  algorithm  on  a  set  of  eighteen  test  problems. 

Together,  these  enhancements  enable  the  Sloan  algorithm  to  compute  small  envelope 
and  wavefront  sizes  fast — the  time  it  needs  is  in  general  between  two  to  five  times  that 
of  the  simpler  RCM  algorithm. 

This  paper  is  the  third  in  a  series  on  spectral  algorithms  for  envelope  and  wavefront 
reduction.  We  will  now  summarize  the  findings  in  the  first  two  papers  to  put  our  work 
on  the  hybrid  algorithm  in  context. 

Barnard,  Pothen,  and  Simon  [4]  described  a  spectral  algorithm  that  associates  a 
Laplacian  matrix  with  the  given  symmetric  matrix,  computes  an  eigenvector  correspond¬ 
ing  to  the  smallest  positive  Laplacian  eigenvalue,  and  then  computes  the  permutation 
by  sorting  the  components  of  the  eigenvector  in  monotonically  increasing  or  decreasing 
order. 

Unlike  the  rest  of  the  algorithms  that  are  combinatorial  in  nature,  the  spectral  al¬ 
gorithm  is  algebraic,  and  hence  its  good  envelope-reduction  properties  are  intriguing. 
George  and  Pothen  [16]  analyzed  the  algorithm  theoretically,  by  considering  a  related 
problem  called  the  2-sum  problem.  They  showed  that  minimizing  the  2-sum  over  all 
permutations  is  equivalent  to  a  quadratic  assignment  problem,  in  which  the  trace  of  a 
product  of  matrices  is  minimized  over  the  set  of  permutation  matrices.  This  problem 
is  NP-complete;  however,  lower  bounds  for  the  2-sum  could  be  obtained  by  minimizing 
over  the  set  of  orthogonal  and  doubly  stochastic  matrices.  (Permutation  matrices  satisfy 
the  additional  property  that  their  elements  are  nonnegative;  this  property  is  relaxed  to 
obtain  a  lower  bound.)  This  technique  gave  tight  lower  bounds  for  the  2-sum  for  many 
finite-element  problems,  showing  that  the  2-sums  from  the  spectral  ordering  were  nearly 
optimal  (within  a  few  percent  typically).  They  also  showed  that  the  permutation  matrix 
closest  to  the  orthogonal  matrix  attaining  the  lower  bound  is  obtained  (to  first  order) 
by  permuting  the  second  Laplacian  eigenvector  in  monotonic  order.  This  justifies  the 
spectral  algorithm  for  minimizing  the  2-sum.  These  authors  also  showed  that  a  family  of 
graphs  with  small  (n'^)  separators  has  small  mean  square  wavefront  (at  most  (9(n^+'’')), 
where  n  is  the  number  of  vertices  in  the  graph,  and  the  exponent  7  >  1/2  determines 
the  separator  size. 

The  analysis  of  the  spectral  algorithm  suggests  that  while  spectral  orderings  may 
also  reduce  related  quantities  such  as  the  envelope  size  and  the  work  in  an  envelope 
factorization,  they  might  be  improved  further  by  post-processing  with  a  combinatorial 
reordering  algorithm.  We  explore  this  issue  further  by  using  the  second  step  of  the 
Sloan  algorithm  in  the  post-processing  step;  the  resulting  algorithm  is  called  the  hybrid 
algorithm  in  the  rest  of  this  paper. 

We  list  some  work  on  related  problems.  Juvan  and  Mohar  [27,  28]  have  considered 
spectral  methods  for  minimizing  the  p-sum  problem  (for  p  >  1),  and  Paulino  et  al.  [35, 
36]  have  applied  spectral  orderings  to  minimize  envelope  sizes.  Additionally,  spectral 
methods  have  been  applied  successfully  in  areas  such  as  graph  partitioning  [26,  37,  38], 
the  seriation  problem  [3],  and  DNA  sequencing  [20]. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  review  background 
information.  First  we  define  various  envelope  parameters,  delve  into  the  details  of  the 
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spectral  algorithm,  and  then  describe  a  problem  where  the  spectral  algorithm  performs 
poorly  but  where  the  hybrid  algorithm  does  well.  Section  3  describes  the  details  of  a 
weighted  Sloan  algorithm;  we  show  how  the  envelope  parameters  vary  as  a  function  of 
the  weights  in  the  priority  function.  We  analyze  the  time  complexity  of  our  efficient 
implementation  (in  the  Appendix),  and  show  that  it  runs  about  four  times  faster,  on 
the  average,  than  previous  implementations.  In  Section  4,  we  then  describe  the  hybrid 
algorithm,  which  refines  the  spectral  ordering  by  means  of  the  second  step  of  a  modified 
Sloan  algorithm.  In  Section  5,  we  present  results  from  the  RCM,  Sloan,  spectral,  and  hy¬ 
brid  ordering  algorithms  for  a  collection  of  problems.  Comparisons  are  made  across  four 
envelope  parameters  (envelope  size,  bandwidth,  maximum  wavefront,  and  mean-square 
wavefront),  and  running  time.  Section  6  presents  some  preliminary  results  from  using 
the  hybrid  ordering  in  frontal  Cholesky  and  incomplete  factorization  preconditioning. 
Conclusions  and  directions  for  future  work  are  included  in  Section  7. 


2  Background 

We  provide  definitions  of  various  envelope  parameters  in  Section  2.1,  and  review  the 
spectral  algorithm  for  envelope  and  wavefront  reduction  in  Section  2.2.  Then  in  Sec¬ 
tion  2.3,  we  motivate  the  hybrid  algorithm  by  describing  a  class  of  problems  where  a 
poor  spectral  ordering  is  improved  by  the  Sloan  post-processing  step  in  the  hybrid. 


2.1  Definitions  and  Notation 

Consider  a  sparse  symmetric  n  x  n  matrix  A  =  [uq],  whose  diagonal  elements  are  all 
nonzero.  We  consider  only  the  lower  triangle  of  A  (including  the  diagonal).  Let  fi{A) 
denote  the  column  index  of  the  first  nonzero  element  of  the  ith  row.  The  row  width  of 
the  ith  row,  rWj(A),  is  the  difference  between  i  and  /j(A),  or  equivalently, 

rWe(A)=  max  {f-j}. 

The  envelope  of  a  matrix  is  defined  as 

Env(A)  =  {(f,i)  :  ffiA)  <  j  <iA  <i  <n}  . 

The  envelope  of  a  symmetric  matrix  is  easily  visualized:  picture  the  lower  triangle  of 
the  matrix,  and  remove  the  diagonal  and  the  leading  zero  elements  in  each  row.  The 
remaining  elements  (whether  nonzero  or  zero)  are  in  the  envelope  of  the  matrix.  The 
number  of  these  elements  is  the  envelope  size^  Esize(A)  =  |Env(A)|,  which  can  also  be 
expressed  as 

n 

Esize(^) 

i-1 

Sloan  [39]  uses  the  term  profile  which  denotes  the  envelope  size  plus  the  number  of 
elements  on  the  diagonal. 

Another  envelope  parameter  is  the  bandwidth  of  a  matrix,  defined  as 


bw(A) 


max  {rwj-  (A)| . 

l<t<n 
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Consider  the  ith  step  of  Cholesky  factorization  where  only  the  lower  triangle  of  A  is 
stored.  An  equation  (row)  k  is  active  at  the  ith  step  if  A;  >  i  and  there  exists  a  column 
I  <  i  such  that  7^  0.  The  ith  wavefront  of  A,  wfi(A),  is  the  set  of  active  equations 
during  the  ith  step  of  Cholesky  factorization.  We  can  describe  the  ith  wavefront  in  three 
ways  that  are  more  intuitive.  It  is  the  set  of  rows  that  have  nonzeros  in  the  submatrix 
consisting  of  the  first  i  columns  of  A  and  rows  i  to  n.  It  is  also  the  set  of  rows  in  the  fth 
column  that  are  within  the  envelope  of  the  matrix,  where  the  ith  row  is  also  included. 
We  can  also  define  the  zth  wavefront  in  terms  of  the  adjacency  graph  of  A.  If  X  is  a  set 
of  vertices  in  a  graph,  then  its  adjacency  set 

adj(X)  =  (  U  adj(u)')  \X 

In  the  adjacency  graph  of  A,  the  ith  wavefront  consists  of  the  vertex  i  together  with  the 
set  of  vertices  adjacent  to  the  vertices  numbered  from  1  to  i.  Formally,  the  ith  wavefront 
is 

wfi(A)  =  Vi(J  adj  {{vi,V2, . . . ,  uj). 

The  n  wavefront  sizes  (one  for  each  column)  can  be  characterized  by  the  values 
maximum  wavefront  and  mean-square  wavefront 

maxwf(Ai)  =  max  {|wfi(A)|}, 

l<«<n 

mswf(A)  =  —  ^  |wfe(A)p  . 

”  i=i 

The  maximum  wavefront  size  measures  the  maximum  storage  needed  for  a  frontal  matrix 
during  a  frontal  factorization,  while  the  mean  square  wavefront  measures  the  number  of 
floating  point  operations  in  the  factorization.  Duff,  Reid,  and  Erisman  [9]  discuss  the 
application  of  wavefront  reducing  orderings  to  frontal  factorization.  It  is  easy  to  verify 
the  identity 

n  n 

^  |wfi(A)|  -  n  +  ^rwi(A)  =  n  +  Esize- 

2=:1  i=l 

The  envelope  and  wavefront  parameters  depend  on  the  order  in  which  vertices  of 
the  graph  are  numbered  and  are  independent  of  the  numerical  values  of  the  actual 
matrix  elements.  This  process  of  vertex  numbering  permutes  the  corresponding  matrix 
symmetrically  by  rows  and  columns.  Formally,  we  construct  a  permutation  matrix  P 
for  a  given  ordering  and  symmetrically  permute  a  matrix  A  such  that 


A'  =  PAP'^. 


The  goal  is  to  find  a  permutation  matrix  or  an  ordering  of  the  vertices  of  adjacency  graph 
to  minimize  the  envelope  size  or  the  mean-square- wavefront.  Minimizing  the  envelope 
size  and  the  bandwidth  of  a  matrix  are  NP-complete  problems  [31];  and  related  problems 
such  as  minimizing  the  2-sum  are  also  NP-complete  [16]. 

Figure  1(a)  shows  a  small  two-dimensional  grid  and  Figure  1(b)  shows  the  structure 
of  its  associated  matrix  A.  Figure  1(c)  is  a  table  showing  the  row-widths  and  wavefronts 
of  the  matrix  A.  From  this  table,  we  can  compute  the  parameters  Esize(A)  =  46, 
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Figure  1:  A  two  dimensional  mesh  and  its  vertex  ordering  are  shown  in  (a),  the  structure 
of  the  associated  matrix  is  in  (b),  and  a  table  of  pertinent  data  is  in  (c). 


bw(A)  =  4,  maxwf(A)  =  5,  and  mswf(A)  Rs  16.4.  If  we  numbered  the  vertices  in  Figure  1 
in  a  spiral  fashion  beginning  with  vertex  one  and  numbering  from  the  outside  towards 
the  inside,  the  permuted  matrix  A'  yields  Esize(A')  =  59,  bw(A')  =  11,  maxwf(A')  =  7, 
and  mswf(A')  ps  24.8. 

The  unstructured  grid  bcsstkSO  is  the  stiffness  matrix  of  an  off-shore  generator 
platform  from  the  Harwell-Boeing  test  collection  [10].  We  show  the  nonzero  patterns 
from  the  RCM,  Sloan,  spectral,  and  hybrid  orderings  in  Figure  2. 


2.2  Spectral  Ordering  Algorithm 

Spectral  methods  associate  a  Laplacian  matrix  with  the  given  symmetric  matrix  A, 


Laplacian(A)  =  [fj]  = 


'  -1  if  z  /  j,  Oij  ^  0 

<  0  if  z  7^  j,  Oij  =  0  . 

'Ai=j 


The  Laplacian  matrix  of  an  undirected  graph  is  defined  as  the  Laplacian  matrix  asso¬ 
ciated  with  its  adjacency  matrix.  The  Laplacian  matrix  is  a  singular  M-matrix.  By 
construction,  the  Laplacian  has  row  and  column  sums  identically  zero.  Its  smallest 
eigenvalue  is  zero,  and  the  corresponding  eigenvector  is  the  vector  of  all  ones.  If  the 
given  matrix  is  irreducible,  or  equivalently,  if  its  adjacency  graph  is  connected,  zero  is  a 
simple  eigenvalue.  An  eigenvector  corresponding  to  the  smallest  positive  eigenvalue  of 
the  Laplacian  matrix  is  called  a  Fiedler  vector  in  recognition  of  the  pioneering  work  of 
Miroslav  Fiedler  on  the  spectral  properties  of  the  Laplacian  [12,  13]. 

The  spectral  ordering  is  obtained  by  sorting  the  components  of  the  Fiedler  vector  in 
monotonically  nonincreasing  or  nondecreasing  order.  The  same  permutation  is  applied 
to  the  original  matrix  to  obtain  the  spectral  ordering.  George  and  Pothen  [16]  show 
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Figure  3:  The  hybrid  ordering  of  the  roach  grid  and  its  associated  matrix. 


that  reversing  the  ordering  will  change  (improve  or  deteriorate)  the  envelope  size  by  a 
multiplicative  factor  that  is  at  most  the  maximum  degree  of  a  vertex  in  the  graph. 

We  do  not  need  to  compute  the  Fiedler  vector  very  accurately  for  these  applications. 
Since  a  multilevel  algorithm  is  used  to  compute  the  Fiedler  vector  for  the  large  problems 
that  we  consider,  the  practical  implementations  of  our  algorithms  sometimes  work  with 
misconverged  Fiedler  vectors.  Our  experience  is  that  these  misconverged  vectors  work 
quite  well  in  this  application.  Greater  reductions  in  the  envelope  parameters  result  when 
a  local  refinement  algorithm,  such  as  the  Sloan  algorithm,  is  used,  than  by  computing 
the  Fiedler  vector  more  accurately.  Similar  observations  have  been  made  when  multilevel 
algorithms  are  used  in  graph  partitioning  [24]. 

We  find  that  on  many  finite  element  problems  spectral  orderings  do  well  in  a  global 
sense,  but  often  do  poorly  on  a  local  scale.  It  is  exactly  this  amenability  to  local 
refinement  that  we  seek  to  exploit  with  our  hybrid  algorithm. 

2.3  Counter-Examples  for  Spectral  Envelope  Reduction 

The  spectral  algorithm  computes  the  lowest  wavefront  and  envelope  sizes  over  current 
algorithms  for  many  finite  element  meshes  as  the  results  in  Section  5  will  show.  However, 
there  are  problems  on  which  the  spectral  method  can  perform  poorly,  as  can  be  seen  in 
the  results  presented  in  Subsection  5.2.  Here  we  consider  an  example  due  to  Guattery 
and  Miller  [22]  where  a  spectral  partitioning  algorithm  fails  to  find  a  good  cut  if  the  part 
sizes  must  be  balanced,  turns  out  to  be  one  on  which  the  spectral  ordering  algorithm 
does  badly  as  well.  We  show  that  the  hybrid  algorithm,  in  which  the  spectral  ordering 
is  refined  by  the  Sloan  algorithm  in  a  post-processing  step,  does  well  on  this  problem. 

Figure  3  shows  an  example  of  the  “roach”  graph  and  the  ordering  computed  by  the 
hybrid  algorithm.  The  roach  graph  is  a  ladder  with  the  top  2/3  of  the  rungs  removed. 
For  a  given  positive  integer  k,  this  graph  has  6A;  vertices:  2k  along  each  “antenna”,  and 
2k  vertices  on  the  ladder.  The  spectral  ordering  of  this  graph  would  begin  numbering 
from  the  endpoint  of  one  of  the  antennae,  march  along  the  outline  of  the  graph,  and  end 
at  the  endpoint  of  the  other  antenna.  This  leads  to  an  envelope  size  of  2A;^,  and  a  mean 
square  wave  front  of  k^JlS.  (Only  leading  terms  are  shown.)  It  can  be  seen  in  Figure  3 
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that  the  hybrid  algorithm  numbers  nodes  along  one  antenna,  then  alternates  across  the 
rungs  of  the  ladder,  and  finally  numbers  the  second  antenna.  This  leads  to  an  envelope 
size  of  lOA;,  and  a  mean  square  wavefront  of  {2/3)k,  an  order  of  magnitude  decrease  in 
both. 

For  the  benefit  of  the  reader  familiar  with  graphs  constructed  from  the  crossproduct 
of  a  path  and  double  tree,  described  in  [22],  we  mention  that  the  proposed  hybrid 
algorithm  exhibits  similar  behavior. 

3  A  Fast  Implementation  of  the  Sloan  Algorithm 

We  describe  a  variant  of  the  Sloan  algorithm  applicable  to  vertex-weighted  graphs  in 
Section  3.1;  we  also  discuss  the  behavior  of  the  envelope  parameters  as  a  function  of  the 
weights  in  the  Sloan  algorithm.  In  Section  3.2,  we  describe  an  efficient  implementation 
of  this  algorithm.  The  Appendix  contains  a  complexity  analysis  to  demonstrate  that  the 
new  implementation  takes  O(nlogn)  time  for  problems  with  good  separators,  whereas 
earlier  implementations  require  at  least  time. 

3.1  The  Weighted  Sloan  Algorithm 

In  this  section  we  consider  a  weighted  graph  on  a  set  of  multi-vertices  and  edges,  with 
integer  weights  on  the  multi-vertices.  We  think  of  the  weighted  graph  as  being  derived 
from  an  unweighted  graph,  and  the  weight  of  a  multi- vertex  as  the  number  of  vertices 
of  the  unweighted  graph  that  it  represents.  The  weighted  graphs  in  our  applications  are 
obtained  from  finite  element  meshes,  where  neighboring  vertices  with  the  same  adjacency 
structures  are  “condensed”  together  to  form  multi-vertices.  The  weighted  graph  could 
potentially  have  fewer  vertices  and  many  fewer  edges  than  the  original  unweighted  graph 
in  many  finite  element  problems.  Duff,  Reid,  and  Scott  [11]  call  the  weighted  graph  the 
supervariable  connectivity  graph.  Ashcraft  [2]  refers  to  it  as  the  compressed  graph,  and 
has  used  it  to  speed  up  the  minimum-degree  algorithm,  and  Wang  [40]  used  it  for  an 
efficient  nested  dissection  algorithm. 

A  few  graph-theoretic  concepts  are  needed  to  describe  Sloan’s  algorithm.  The  dis¬ 
tance  between  two  vertices  in  a  graph  is  the  number  of  edges  in  a  shortest  path  joining 
them.  The  diameter  is  a  path  in  the  graph  whose  length  is  the  largest  distance  between 
any  two  vertices.  A  pseudo- diameter  is  an  approximation  to  a  diameter. 

Sloan’s  algorithm  [39]  is  a  graph  traversal  algorithm  that  has  two  parts.  The  first 
part  is  heuristic  algorithm  that  selects  a  start  vertex  s  and  an  end  vertex  e  that  form  the 
endpoints  of  a  pseudo-diameter.  The  second  part  then  numbers  the  vertices,  beginning 
from  s,  and  chooses  the  next  vertex  to  number  from  a  set  of  eligible  vertices  by  means  of  a 
priority  function.  Roughly,  the  priority  of  a  vertex  has  a  dynamic  and  static  component: 
the  dynamic  component  favors  a  vertex  that  increases  the  current  wavefront  the  least, 
while  the  static  part  favors  vertices  at  the  greatest  distance  from  the  end  vertex  e.  The 
computation-intensive  part  of  the  algorithm  is  maintaining  the  priorities  of  the  eligible 
vertices  correctly  as  vertices  are  numbered. 

We  follow  Duff,  Reid  and  Scott  in  their  efficient  scheme  to  compute  the  pseudo¬ 
diameter  in  the  first  step  of  the  Sloan  algorithm. 
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function  Sloan 
begin 

{  Initialize:  given  a  vertex- weighted  graph  G,  weights  Wi  and  PV25 
start  vertex  s,  end  vertex  e,  and  adjacency  lists  of  vertices} 

0.  norm  =  [dist(s,  e)/AJ; 

1.  for  z  —  1  to  n 

2.  status[f]  <—  inactive 

3.  P[i]  =  —Wi  *  norm  *  incr(i)  -j-  W2  ^  dist(i,  e)  • 

4.  endfor 

5.  status [5]  preactive 

{  Main  Loop  } 
for  Ar  =  1  to  rz 

i  =  vertex  of  maximum  priority  (P[*])  among  all  active  or  preactive  vertices 
order  [z]  <—  k 
forall  j  G  adj(i)  do 

case  (status[i]  =  preactive  and  status[j]  =  inactive  or  preactive): 

P[j]  ^  P[j]  4- (size (f)  -h  size(j))  *  norm  *  14^1  {j  now  active,  i  numbered} 
status[j]  <—  active 
far_neighbors(j) 

break 

case  (status[i]  =  preactive  and  status[j]  =  active): 

P[j]  ^  P[j]  +size(i)  *  norm  *  Wi  {i  moves  from  preactive  to  numbered} 

break 

case  (status [i]  —  active  and  status[j]  =  preactive): 

P[j]  P[j]  -|-size(j)  *  norm  *  Wi  {j  moves  from  preactive  to  active} 
status  [j]  active 
far  .neighbors  (j) 
break 

end  forall 

status [/]  f-  numbered 

end  for 

end 

function  far_neighl)ors(j) 

begin 

26.  forall  i  G  adj(j)(r  /)  do 

27.  if  (status[r]  =  inactive)  then  status[^]  <—  preactive  end  if 

28.  P[i]  ^  P[b]  -hsize(j)  *  norm*  Wi  {j  now  active} 

29.  end  forall 
end 


Figure  4:  The  Sloan  algorithm  for  a  vertex-weighted  graph. 
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Eligible  Vertices.  Vertices  are  in  four  mutually  exclusive  states  at  each  step  of  the 
algorithm.  Any  vertex  that  has  already  been  numbered  in  the  algorithm  is  a  numbered 
vertex.  Active  vertices  are  unnumbered  vertices  that  are  adjacent  to  some  numbered 
vertex.  Vertices  that  are  adjacent  to  active  vertices  but  are  neither  active  nor  numbered 
are  called  preactive  vertices.  All  other  vertices  are  inactive.  Initially  all  vertices  are 
inactive,  except  for  s,  which  is  preactive. 

At  any  step  k,  the  sum  of  the  sizes  of  the  active  vertices  is  exactly  the  size  of  the 
wavefront  at  that  step  for  the  reordered  matrix,  wffc(/lAP^),  where  P  is  the  current 
permutation.  Active  and  preactive  vertices  comprise  the  set  of  vertices  eligible  to  be 
numbered  in  future  steps. 

An  eligible  vertex  with  the  maximum  priority  is  chosen  to  be  numbered  next.  The 
priority  function  of  a  vertex  i  has  two  components:  incr(?),  the  increase  in  the  wavefront 
size  (the  number  of  additional  vertices  that  enter  the  wavefront)  if  i  were  to  be  numbered 
next,  and  dist(i,e),  its  distance  from  the  end  vertex  e. 

Increase  in  Wavefront  Size.  Our  implementation  of  the  weighted  Sloan  algorithm 
on  the  weighted  graph  mimics  what  the  Sloan  algorithm  would  do  on  an  unweighted 
graph,  and  thus  we  define  the  degrees  of  the  vertices  and  incr(?)  differently  from  Duff, 
Reid,  and  Scott  [11]. 

We  denote  by  size(i)  the  integer  weight  of  a  multi-vertex  i.  The  degree  of  the  multi¬ 
vertex  i,  deg(i),  is  the  sum  of  the  sizes  of  its  neighboring  multi- vertices.  Let  the  current 
degree  of  a  vertex  i,  cdeg{i),  denote  the  sum  of  the  sizes  of  the  neighbors  of  i  among 
preactive  or  inactive  vertices.  It  can  be  computed  by  subtracting  from  the  degree  of 
i  the  sum  of  the  sizes  of  its  neighbors  that  are  numbered  or  active.  When  an  eligible 
vertex  is  assigned  the  next  available  number,  its  preactive  or  inactive  neighbors  move 
into  the  wavefront.  Thus 

_  J  cdeg(i) -b  size(i),  if  i  is  preactive 
incr(z)  —  I  cdeg(z),  if  i  is  active 

The  size(i)  term  for  a  preactive  vertex  i  accounts  for  the  inclusion  of  i  into  the 
wavefront.  (Recall  that  the  definition  of  the  wavefront  includes  the  diagonal  element.) 
Initially,  incr(i)  is  deg(i)  -|-  size(i)  since  nothing  is  in  the  wavefront  yet. 

The  second  component  of  the  priority  function,  dist(i,  e),  measures  the  distance  of  a 
vertex  i  from  the  end  vertex  e.  This  component  encourages  the  numbering  of  vertices 
that  are  very  far  from  e  even  at  the  expense  of  a  larger  wavefront  at  the  current  step. 
This  component  is  easily  computed  for  all  i  by  a  breadth  first  search  rooted  at  e. 

The  Priority  Function.  Denote  by  P{i)  the  priority  of  an  eligible  vertex  i  during 
a  step  of  the  algorithm.  The  priority  function  used  by  Sloan,  and  Duff,  Reid  and  Scott 
is  a  linear  combination  of  two  components 

P{i)  =  —Wi  *  incr(i)  -|-  W2  *  dist(i,  e), 

where  Wi  and  W2  are  positive  integer  weights.  At  each  step,  the  algorithm  numbers 
next  an  eligible  vertex  i  that  maximizes  this  priority  function. 

The  value  of  incr(i)  ranges  from  0  to  (A  -|-  1)  (where  A  is  the  maximum  degree  of 
the  unweighted  graph  G),  while  dist(i,  e)  ranges  from  0  to  the  diameter  of  the  graph  G. 
We  felt  it  desirable  for  the  two  terms  in  the  priority  function  to  have  the  same  range 
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Figure  5:  Envelope  parameters  of  BARTH5  as  a  function  of  the  ratio  of  the  weights  Wi 
and  W2. 


so  that  we  could  work  with  normalized  weights  Wi  and  W2.  Hence  we  use  the  priority 
function 

P{i)  =  —  Wi  *  [(dist(s,  e)/A)J  *  incr(i)  +  W2  *dist(i,e). 

If  the  pseudo-diameter  is  less  than  the  maximum  degree,  we  set  their  ratio  to  one.  We 
discuss  the  choice  of  the  weights  later  in  this  section. 

The  Algorithm.  We  present  in  Figure  4  our  version  of  the  weighted  Sloan  algo¬ 
rithm.  This  modified  Sloan  algorithm  requires  fewer  accesses  into  the  data  structures 
representing  the  graph  (or  matrix)  than  the  original  Sloan  algorithm.  The  priority  up¬ 
dating  in  the  algorithm  ensures  that  incr(j)  is  correctly  maintained  as  vertices  become 
active  or  preactive.  When  a  vertex  i  is  numbered,  its  neighbors  and  possibly  their 
neighbors  need  to  be  examined.  Vertex  i  must  be  active  or  preactive,  since  it  is  eligible 
to  be  numbered.  We  illustrate  the  updating  of  the  priorities  for  only  the  first  case  in 
the  algorithm,  since  the  others  can  be  obtained  similarly.  Consider  the  case  when  i  is 
preactive  and  j  is  inactive  or  preactive.  The  multi-vertex  i  moves  from  being  preactive 
to  numbered,  and  hence  moves  out  of  the  wavefront,  decreasing  incr(j)  by  size(«),  and 
thereby  increases  P{j)  by  Wi  *  [(dist(s,  e)/A)J  *  size(f).  Further,  since  j  becomes  active 
and  is  now  included  in  the  wavefront,  it  does  not  contribute  in  the  future  to  incr(/),  and 
hence  P{j)  increases  by  Wi  *  [(dist(s,  e)/A)J  *  size(j). 

The  Choice  of  Weights.  Sloan  [39],  and  Duff,  Reid  and  Scott  [11]  recommend 
the  unnormalized  weights  Wi  =  2,  W2  =  1.  We  studied  the  influence  of  the  normalized 
weights  Wi  and  W2  on  the  envelope  parameters,  and  found,  to  our  initial  surprise,  that 
the  problems  we  tested  fell  into  two  classes. 

The  first  class  is  exemplified  by  the  BARTHS  problem,  whose  envelope  parameters  are 
plotted  for  various  ratios  of  the  weights  in  Figure  5.  The  t/-axis  reports  the  value  of  each 
envelope  parameter  scaled  with  respect  to  the  value  obtained  with  the  unnormalized 
weights  Wi  =  1  and  W2  =  2  in  the  Sloan  algorithm.  Thus  this  and  the  next  Figures 
reveal  the  improvements  obtained  by  normalizing  the  weights  in  the  Sloan  algorithm. 
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Figure  6:  Envelope  parameters  of  FINANCE512  as  a  function  of  the  ratio  of  the  weights 
Wi  and  W2. 


(Information  about  the  problems  in  these  Figures  is  included  in  Table  5.) 

The  envelope  parameters  are  plotted  at  successive  points  on  the  x-axis  corresponding 
to  changing  the  weight  Wi  or  W2  by  a  factor  of  two.  The  ratio  of  the  pseudo-diameter  to 
maximum  degree  is  10  for  this  problem,  and  here  large  values  of  Wi  lead  to  the  smallest 
envelope  size  and  wavefront  sizes.  The  normalized  weights  Wi  =  2  and  W2  =  1  suffice 
to  obtain  these  values;  note  the  asymptotic  behavior  of  the  envelope  parameters.  The 
bandwidth  has  a  contrarian  behavior  to  the  rest  of  the  parameters,  and  thus  high  values 
of  W2  lead  to  small  bandwidths  for  these  problems. 

The  second  class  is  exemplified  by  the  FINANCE512  problem,  whose  envelope  param¬ 
eters  are  plotted  for  various  choice  of  weights  in  Figure  6.  Again,  the  value  of  each 
parameter  is  scaled  by  the  value  obtained  by  the  Sloan  algorithm  with  unnormalized 
weights  Wi  -  2,  W2  =  1-  The  ratio  of  the  pseudo-diameter  to  maximum  degree  is  1. 
Here  high  values  of  H  2  lead  to  small  envelope  parameters.  Note  that  the  bandwidth  fol¬ 
lows  the  same  trend  as  the  rest  of  the  envelope  parameters,  unlike  the  first  class.  Other 
problems  from  Table  5  that  belong  to  this  class  are:  FORDl,  F0RD2,  SKIRT,  NASARB, 
BCSSTK30,  and  FINANCE256.  .Ml  other  problems  belong  to  the  first  class. 

A  user  needs  to  experiment  with  the  weights  to  obtain  a  near-optimal  value  of  an 
envelope  parameter  for  a  new  problem,  since  one  does  not  know  a  priori  which  of  the 
two  classes  it  belongs  to.  Fortunately,  small  integer  weights  suffice  to  get  good  results 
in  our  experiments,  and  hence  a  set  of  good  weights  can  be  selected  automatically  by 
computing  the  envelope  parameters  with  a  few  different  weights. 

The  results  tabulated  in  Section  5  show  that  it  is  possible  to  reduce  the  mean  square 
wavefront  by  choosing  one  normalized  set  of  weights  for  each  problem  in  Class  1,  and 
another  for  each  problem  in  Class  2,  rather  than  the  unnormalized  weights  {Wi  =  2, 
W2  =  1)  used  by  Sloan,  and  Duff,  Reid  and  Scott.  The  weights  we  have  used  are  Wi  =  8, 
IF2  =  1  for  Class  1  problems,  and  kFi  =  1,  IF2  =  2  for  problems  in  Class  2.  An  automatic 
procedure  could  compute  the  envelope  parameters  for  a  few  sets  of  weights,  and  then 
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choose  the  ordering  with  the  smaller  values. 

There  are  two  limiting  cases  of  the  Sloan  algorithm. 

When  Wi  =  0,  W2  ^  0,  then  the  distance  from  the  end  vertex  e  determines  the 
ordering,  and  the  Sloan  algorithm  behaves  almost  like  RCM.  However,  this  limiting  case 
differs  from  the  case  when  Wi  is  nonzero  and  W2  is  much  larger  than  Wi .  In  the  latter 
case,  the  first  term  still  plays  a  role  in  reducing  the  envelope  parameters.  For  instance, 
the  values  of  envelope  parameters  obtained  when  the  ratio  W2/W1  is  2^®  are  significantly 
smaller  than  the  values  obtained  when  fFi  =  0  and  W2  7^  0.  Only  neighbors  and  second- 
order  neighbors  of  the  numbered  vertices  are  eligible  to  numbered  at  any  step,  and 
among  these  vertices  the  first  term  serves  to  reduce  the  local  increase  in  the  wavefront 
when  Wi  is  nonzero. 

The  second  limiting  case,  when  IV2  =  0,  Wi  ^  0,  corresponds  to  a  greedy  algorithm 
in  which  vertices  are  always  numbered  to  reduce  the  local  increase  in  wavefront.  This 
greedy  algorithm  does  particularly  poorly  on  Class  2  problems. 

The  two  classes  of  problems  differ  in  the  importance  of  the  first,  “local”,  term  that 
controls  the  incremental  increase  in  the  wavefront  relative  to  the  second,  “global”,  term 
that  emphasizes  the  numbering  of  vertices  far  from  the  end-vertex.  When  the  first  term 
is  more  important  in  determining  the  envelope  parameters,  the  problem  belongs  to  Class 
1,  and  when  the  second  term  is  more  important,  it  belongs  to  Class  2.  We  have  observed 
that  the  first  class  of  problems  represent  simpler  meshes:  e.g.,  discretization  of  the  space 
surrounding  a  body,  such  as  an  airfoil  in  the  case  of  BARTH5.  The  problems  in  the  second 
class  arise  from  finite  element  meshes  of  complex  three-dimensional  geometrical  objects, 
such  as  automobile  frames.  The  FINANCE512  problem  is  a  linear  program  consisting  of 
several  subgraphs  joined  together  by  a  binary  tree  interconnection.  In  these  problems,  it 
is  important  to  explore  several  “directions”  in  the  graph  simultaneously  to  obtain  small 
envelope  parameters. 

The  bandwidth  is  smaller  when  larger  weights  are  given  to  the  second  term,  for  both 
classes  of  problems.  This  is  to  be  expected,  since  to  reduce  the  bandwidth,  we  need  to 
decrease,  over  all  edges,  the  maximum  deviation  between  the  numbers  of  the  endpoints 
of  an  edge. 

3.2  The  Accelerated  Implementation 

In  the  Sloan  algorithm,  the  vertices  eligible  for  numbering  are  kept  in  a  priority  queue. 
Sloan  [39]  implemented  the  priority  queue  both  as  an  unordered  list  in  an  array  and  as  a 
binary  heap,  and  found  that  the  array  implementation  was  faster  for  his  test  problems  (all 
with  less  than  3,000  vertices).  Hence  he  reported  results  from  the  array  implementation 
only.  Duff,  Reid,  and  Scott  [11]  have  followed  Sloan  in  using  the  array  implementation 
for  the  priority  queue  in  the  Harwell  library  routine  MC40  [1]. 

We  provide  a  complexity  analysis  of  the  worst-case  execution  time  of  the  two  im¬ 
plementations  in  the  Appendix,  which  shows  that  the  heap  implementation  runs  in 
0(n  log  n)  time,  while  the  array  implementation  requires  0(n^'^)  time  for  two-dimensional 
problems,  and  time  for  three-dimensional  problems. 

This  difference  in  running  time  requirements  is  experimentally  observed  as  well.  In 
Figure  7  we  compare  the  times  taken  by  the  array  and  heap  implementations  of  the  Sloan 
algorithm  relative  to  our  implementation  of  the  RCM  algorithm.  The  RCM  algorithm 
uses  a  fast  pseudo-diameter  algorithm  described  by  Duff,  Reid,  and  Scott  [11]. 
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Figure  7:  Relative  timing  performance  of  RCM,  ArraySloan,  and  HeapSloan  algorithms. 


For  the  eighteen  matrices  in  Table  5,  the  mean  time  of  the  ArraySloan  was  11.3  times 
that  of  RCM,  while  the  median  time  was  8.2  that  of  RCM.  However,  the  mean  cost  of 
the  HeapSloan  was  only  2.5  times  of  RCM,  with  the  median  cost  only  2.3.  The  greatest 
improvements  are  seen  for  the  problems  with  greater  numbers  of  vertices  or  with  higher 
average  degrees. 

We  have  also  computed  the  times  taken  by  MC40B  to  order  these  problems,  and 
found  them  to  be  comparable  to  the  times  reported  here  for  the  ArraySloan  implemen¬ 
tation,  inspite  of  the  different  programming  languages  used  (Fortran  for  MC40B  and 
and  C  for  ours.) 

We  emphasize  that  this  change  in  the  data  structure  for  the  priority  queue  has 
no  significant  influence  on  the  quality  of  the  envelope  parameters  computed  by  the 
algorithm.  Minor  differences  might  be  seen  due  to  different  tie-breaking  strategies. 


4  The  Hybrid  Algorithm 

The  hybrid  algorithm  consists  of  two  steps:  first  compute  the  spectral  ordering;  then 
use  a  modification  of  the  second  part  of  the  Sloan  algorithm  to  refine  the  ordering 
locally.  We  shall  refer  to  this  modification  of  the  second  part  as  the  modified  Sloan 
algorithm.  This  abuse  of  nomenclature  should  not  cause  any  confusion  in  the  context 
of  the  hybrid  algorithm.  We  describe  how  we  modified  Sloan  to  refine  a  given  input 
ordering  in  Section  4.1.  Implementation  details  are  presented  in  Section  4.2. 


4.1  Modifications  to  the  Sloan  Algorithm 

To  change  the  Sloan  algorithm  from  one  that  computes  an  ordering  from  scratch  to  one 
that  refines  a  given  ordering,  we  need  to  modify  the  selection  of  start  and  end  nodes,  and 
the  priority  function.  We  use  input  ordering  in  this  section  to  describe  the  ordering  of  the 
matrix  immediately  before  the  Sloan  refinement  is  performed.  In  our  implementation. 
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this  input  ordering  is  the  spectral  ordering,  though  the  refining  algorithm  can  work  with 
any  input  ordering. 

The  Sloan  algorithm  requires  a  start  node  to  begin  numbering  from,  and  an  end  node 
to  compute  the  priority  function.  We  choose  the  start  node  s  to  be  the  first  node  and 
the  end  node  e  to  be  the  last  node  in  the  input  ordering.  Hence  the  burden  of  finding  a 
good  set  of  endpoints  is  placed  on  the  spectral  method.  Experience  suggests  that  this  is 
where  it  should  be.  The  spectral  method  seems  to  have  a  more  global  view  of  the  graph 
than  the  local  diameter  heuristic.  This  feature  alone,  with  no  change  in  the  priority 
function,  yields  improved  envelope  parameters  over  the  Sloan  algorithm  for  most  of  our 
test  problems. 

The  priority  function  is 

P{i)  =  —Wi  *  [(n/A)J  *  incr(i)  +  W2  *  dist(z,e)  —  W3  * 

The  first  two  terms  are  similar  to  the  priority  function  of  the  Sloan  algorithm  (Sub¬ 
section  3.1),  except  that  the  normalization  factor  has  n,  the  number  of  vertices  in  the 
numerator,  rather  than  the  pseudo-diameter.  The  latter  is  not  computed  in  this  context, 
and  this  choice  makes  the  first  and  third  term  range  from  1  to  n. 

This  function  is  sensitive  to  the  initial  ordering  through  the  addition  of  a  third 
weight,  W3.  For  W3  >  0,  higher  priority  is  given  to  lower  numbered  vertices  in  the  input 
ordering.  Conversely,  for  W3  <  0,  priority  is  given  to  higher  numbered  vertices.  This 
effectively  performs  the  refinement  on  the  reverse  input  ordering,  provided  s  and  e  are 
also  reversed.  There  is  some  redundancy  between  weighting  the  distance  from  the  end 
in  terms  of  the  number  of  hops  (dist(i,  e))  and  the  distance  from  the  end  in  terms  of  the 
input  ordering  (i). 

Selection  of  the  nodes  s  and  e  and  the  new  priority  function  are  the  only  algorithmic 
modifications  made  to  the  Sloan  algorithm.  The  node  selection,  node  promotion,  and 
priority  updating  scheme  (see  Fig.  4),  are  unchanged. 

The  normalization  factor  in  the  first  term  of  the  priority  function  makes  the  initial 
influence  of  the  first  and  third  terms  roughly  equal  in  magnitude  when  Wi  and  W3 
are  both  equal  to  1.  The  weight  W2  is  usually  set  to  one.  This  makes  it  a  very  weak 
parameter  in  the  whole  algorithm,  but  small  improvements  result  when  its  influence  is 
nonzero.  If  the  component  of  the  Fiedler  vector  with  the  largest  absolute  value  has  the 
negative  sign,  we  set  W3  =  —  1  and  swap  s  and  e.  Otherwise,  we  set  W3  =  1  and  use  the 
nondecreasing  ordering  of  the  Fiedler  vector. 

For  Class  1  problems,  higher  values  of  Wi  can  lead  to  improvements  in  the  envelope 
parameters  over  the  choice  of  Wi  =  1,  even  though  it  is  slight  in  most  cases.  For  Class 
2  problems,  use  of  VFi  =  1,  W2  =  W3  —  2  can  lead  to  improvements  as  well. 

4.2  Implementation  Details 

All  the  results  presented  in  the  following  section  were  obtained  on  a  Sun  SPARCsta- 
tion  20  with  64MB  physical  main  memory  and  846MB  of  swap  space,  running  SunOS  4.1.3 
The  software  used  includes  Matlab  4.2a,  Chaco  2.0  [24]  and  a  suite  of  Matlab  M-files  and 
MEX-files^  that  we  wrote.  All  of  the  MEX-files  are  written  in  C.  A  toolbox  of  M-files 

^Both  M-files  and  MEX-files  are  programs  in  Matlab.  M-files  are  interpreted  and  are  analogous  to 
UNIX  scripts  or  DOS  batch  files.  MEX  files  are  compiled  C  or  Fortran  codes  that  are  dynamically 
linked  into  Matlab. 
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Problem 

1^1 

i«i 

Comment 

BARTH 

6,691 

19,748 

BARTH4 

6,019 

17,473 

2-D  CFD  problems 

BARTH5 

15,606 

45,878 

SHUTTLE.EDDY 

10,429 

46,585 

COPTERl 

17,222 

96,921 

COPTER2 

55,476 

352,238 

FORDl 

18,728 

41,424 

3-D  structural  problems 

F0RD2 

100,196 

222,246 

SKIRT 

45,361 

1,268,228 

NASASRB 

54,870 

1,311,227 

COMMANCHE_DUAL 

7,920 

11,880 

TANDEM_VTX 

18,454 

117,448 

3-D  CFD  problems 

TANDEMJ9UAL 

84,069 

183,212 

ONERA_DUAL 

85,567 

116,817 

BCSSTK30 

28,924 

1,007,284 

3-D  stiffness  matrix 

PDSIO 

16,558 

66,550 

FINANCE256 

37,376 

130,560 

linear  programs 

FINANCE512 

74,752 

261,120 

COMP.SKIRT 

14,944 

160,461 

compressed  SKIRT 

COMP.NASARB 

24,953 

275,796 

compressed  NASARB 

COMP.BCSSTK30 

9,289 

111,442 

compressed  BCSSTK30 

Table  1:  The  list  of  eighteen  test  problems.  For  the  three  problems  that  compressed 
well,  their  compressed  versions  are  also  shown. 


written  by  Gilbert  [19]  was  used  to  generate  some  model  problems,  visualize  results,  and 
test  code  under  development. 

Matlab  is  the  main  platform  on  which  the  experiments  were  done.  Its  interactive 
environment  is  very  flexible  to  use.  M-files  allowed  for  quick  prototype  code  generation. 
However,  M-files  are  interpreted  and  too  slow,  in  general,  for  matrices  of  reasonable  size. 
The  code  was  then  re-written  in  C,  given  a  Matlab  wrapper  function,  and  linked  as  a 
MEX  file  into  Matlab’s  dynamic  library.  Chaco  was  used  to  obtain  the  Fiedler  vector. 


5  Computational  Results 


We  describe  in  Section  5.1  how  we  chose  the  computational  parameters  in  the  hybrid  al¬ 
gorithm.  In  Section  5.2  we  discuss  the  relative  reductions  in  envelope  size  and  wavefront 
of  eighteen  test  problems  obtained  from  ROM,  Sloan,  spectral,  and  hybrid  algorithms. 


5.1  Chaco’s  User  Parameters 

We  use  the  SymmLQ/RQI  option  in  Chaco  to  obtain  the  Fiedler  vector.  Chaco  takes 
a  multilevel  approach,  coarsening  the  grid  until  it  has  less  than  some  user  specified 
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Problem 

mswf 

maxwf 

Egize 

bw 

Time 

(sec.) 

BARTH 

1.26e4 

164 

7.01e5 

199 

0.13 

BARTH4 

1.61e4 

204 

7.03e5 

218 

0.05 

BARTHS 

5.08e4 

351 

3.26e6 

373 

0.16 

SHUTTLE 

5.84e3 

167 

7.09e5 

238 

0.12 

COPTERl 

2.84e5 

797 

8.62e6 

932 

0.13 

C0PTER2 

2.26e6 

2,447 

7.55e7 

2,975 

0.88 

FORDl 

2.65e4 

223 

2.90e6 

258 

0.30 

F0RD2 

3.74e5 

884 

5.72e7 

963 

1.1 

SKIRT 

l.lleG 

1,745 

4.42e7 

2,070 

5.0 

NASARB 

1.65e5 

840 

2.06e7 

881 

3.3 

COMMANCHE.DUAL 

6.73e3 

150 

5.90e5 

155 

0.07 

TANDEM.VERTEX 

8.28e5 

1,489 

1.53e7 

1,847 

0.27 

TANDEM.DUAL 

1.96e6 

2,008 

1.22e8 

2,199 

1.4 

ONERA.DUAL 

4.86e6 

3,096 

1.71e8 

3,478 

1.2 

BCSSTK30 

1.07e6 

1,734 

2.66e7 

2,826 

3.7 

PDSIO 

3.66e6 

2,996 

2.95e7 

4,235 

0.35 

FINANCE256 

9.38e5 

1,437 

3.26e7 

2,014 

0.51 

FINANCE512 

5.79e5 

879 

5.55e7 

1,306 

1.0 

Table  2:  Envelope  parameters  and  CPU  time  on  a  Sun  Sparc-20  workstation  for  the 
RCM  algorithm. 
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Problem 

SLOAN 

NSLOAN 

(Class) 

SPECTRAL 

HYBRID 

BARTH 

0.48 

0.43  (1) 

0.43 

0.30 

BARTH4 

0.40 

0.21  (1) 

0.20 

0.15 

BARTHS 

0.56 

0.18  (1) 

0.18 

0.14 

SHUTTLE 

0.60 

0.60  (1) 

1.0 

0.65 

COPTERl 

0.71 

0.45  (1) 

0.74 

0.53 

C0PTER2 

0.39 

0.27  (1) 

0.28 

0.16 

FORDl 

0.67 

0.67  (2) 

0.48 

0.39 

FORD2 

0.51 

0.51  (2) 

0.44 

0.33 

SKIRT 

0.57 

0.50  (2) 

0.44 

0.37 

NASARB 

0.74 

0.75  (2) 

0.99 

0.71 

COMMANCHE.DUAL 

0.60 

0.34  (1) 

0.37 

0.23 

TANDEM.VTX 

0.16 

0.12  (1) 

0.14 

0.10 

TANDEM.DUAL 

0.53 

0.28  (1) 

0.14 

0.11 

ONERA.DUAL 

0.44 

0.21  (1) 

0.09 

0.07 

BCSSTK30 

0.37 

0.30  (2) 

0.10 

0.05 

PDSIO 

0.20 

0.13  (1) 

0.75 

0.15 

FINANCE256 

0.04 

0.04  (2) 

0.07 

0.04 

FINANCE512 

0.05 

0.06  (2) 

0.14 

0.05 

COMP.SKIRT 

0.46  (2) 

0.51 

0.39 

COMP.NASARB 

0.68  (2) 

1.8 

0.75 

COMP.BCSSTK30 

0.26  (2) 

0.13 

0.06 

Table  3:  Mean  square  Wavefront  sizes  for  various  algorithms  relative  to  ROM.  The 
numbers  in  parentheses  after  the  values  for  the  normalized  Sloan  algorithm  show  the 
class  each  problem  belongs  to  (See  Section  3). 
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Problem 

SLOAN 

NSLOAN 

SPECTRAL 

HYBRID 

BARTH 

0.66 

0.65 

0.64 

0.53 

BARTH4 

0.60 

0.42 

0.37 

0.34 

BARTH5 

0.77 

0.44 

0.42 

0.39 

SHUTTLE 

0.85 

0.66 

1.3 

0.67 

COPTERl 

0.84 

0.58 

0.65 

0.57 

C0PTER2 

0.58 

0.49 

0.43 

0.32 

FORDl 

0.86 

0.86 

0.96 

0.78 

FORD2 

0.74 

0.78 

0.91 

0.76 

SKIRT 

0.65 

0.84 

0.65 

0.57 

NASARB 

0.73 

0.91 

1.2 

0.86 

COMMANCHE.DUAL 

0.83 

0.55 

0.55 

0.44 

TANDEM.VTX 

0.38 

0.30 

0.29 

0.25 

TANDEM.DUAL 

0.72 

0.55 

0.34 

0.30 

ONERA.DUAL 

0.67 

0.45 

0.34 

0.30 

BCSSTK30 

0.63 

0.64 

0.38 

0.22 

PDSIO 

0.48 

0.40 

1.0 

0.28 

FINANCE256 

0.22 

0.22 

0.30 

0.21 

FINANCE512 

0.28 

0.32 

0.85 

0.49 

COMP.SKIRT 

0.67 

0.68 

0.54 

COMP.NASARB 

0.71 

2.3 

0.78 

COMP.BCSSTK30 

0.52 

0.40 

0.23 

Table  4:  Maximum  wavefront  sizes  relative  to  the  RCM  algorithm. 
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Problem 

SLOAN 

NSLOAN 

SPECTRAL 

HYBRID 

BARTH 

0.69 

0.66 

0.66 

0.55 

BARTH4 

0.64 

0.47 

0.46 

0.40 

BARTHS 

0.75 

0.43 

0.44 

0.39 

SHUTTLE 

0.81 

0.82 

1.0 

0.85 

COPTERl 

0.84 

0.68 

0.89 

0.74 

COPTER2 

0.63 

0.53 

0.56 

0.43 

FORDl 

0.81 

0.80 

0.68 

0.61 

FORD2 

0.71 

0.71 

0.65 

0.56 

SKIRT 

0.77 

0.72 

0.70 

0.63 

NASARB 

0.89 

0.88 

0.99 

0.87 

COMMANCHE.DUAL 

0.73 

0.59 

0.61 

0.47 

TANDEM.VTX 

0.42 

0.37 

0.40 

0.34 

TANDEM.DUAL 

0.72 

0.54 

0.39 

0.34 

ONERA.DUAL 

0.66 

0.46 

0.31 

0.27 

BCSSTK30 

0.60 

0.53 

0.33 

0.25 

PDSIO 

0.41 

0.34 

0.82 

0.38 

FINANCE256 

0.20 

0.22 

0.28 

0.20 

FINANCE512 

0.21 

0.25 

0.34 

0.20 

COMP.SKTRT 

0.70 

0.74 

0.65 

COMP.NASARB 

0.86 

1.1 

0.89 

COMP.BCSSTK30 

0.52 

0.38 

0.26 

Table  5:  Envelope  sizes  relative  to  RCM. 
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Problem 

SLOAN 

NSLOAN 

SPECTRAL 

HYBRID 

BARTH 

2.93 

4.53 

1.76 

4.15 

BARTH4 

5.02 

7.04 

2.64 

7.39 

BARTHS 

3.44 

8.91 

1.96 

5.19 

SHUTTLE 

3.50 

3.39 

2.66 

4.05 

COPTERl 

3.80 

7.34 

1.02 

7.82 

COPTER2 

4.05 

11.4 

1.89 

8.39 

FORDl 

7.67 

6.91 

12.0 

12.0 

FORD2 

7.06 

12.1 

5.75 

8.04 

SKIRT 

9.37 

3.66 

2.13 

2.15 

NAS  ARB 

5.82 

5.83 

4.17 

5.57 

COMMANCHE.DUAL 

9.94 

15.9 

2.52 

8.15 

TANDEM.VTX 

2.35 

3.56 

1.39 

2.29 

TANDEM.DUAL 

3.55 

9.07 

2.92 

4.72 

ONERA.DUAL 

8.93 

11.3 

2.08 

3.19 

BCSSTK30 

5.60 

5.11 

1.91 

2.28 

PDSIO 

3.59 

3.77 

1.87 

3.58 

FINANCE256 

4.41 

4.11 

2.49 

2.44 

FINANCE512 

3.26 

2.88 

2.84 

2.38 

COMP.SKIRT 

6.07 

3.19 

3.16 

COMP.NASAHB 

5.81 

6.83 

4.72 

COMP.BCSSTK30 

4.02 

2.05 

2.03 

Table  6:  Bandwidths  relative  to  RCM. 


Problem 

SLOAN 

SPECTRAL 

HYBRID 

BARTH 

1.9 

10. 

11. 

BARTH4 

3.4 

18. 

20. 

BARTHS 

2.7 

19. 

21. 

SHUTTLE 

2.7 

15. 

17. 

COPTERl 

4.7 

25. 

28. 

C0PTER2 

3.0 

18. 

20. 

FORDl 

1.7 

12. 

13. 

FORD2 

2.7 

19. 

21. 

SKIRT 

1.7 

3.7 

4.5 

NASARB 

2.3 

8.5 

9.7 

COMMANCHE.DUAL 

2.1 

19. 

19. 

TANDEM.VTX 

2.7 

14. 

16. 

TANDEM.DUAL 

2.2 

14. 

15. 

ONERA.DUAL 

2.3 

15. 

15. 

BCSSTK30 

1.7 

3.2 

4.0 

PDSIO 

2.1 

36. 

37. 

FINANCE256 

2.4 

16. 

18. 

FINANCE512 

2.3 

17. 

18. 

COMP.SKIRT 

0.33 

0.69 

0.91 

COMP.NASARB 

0.49 

1.8 

2.3 

COMP.BCSSTK30 

0.34 

0.56 

0.74 

Table  7:  CPU  times  relative  to  the  RCM  algorithm. 


Metric 

Units 

RCM 

SLOAN 

NSLOAN 

SPECTRAL 

HYBRID 

mswf 

le5 

10. 

3.7 

2.3 

3.1 

1.4 

maxwf 

le2 

12. 

7.0 

6.2 

6.9 

4.5 

Esize 

le7 

3.7 

2.3 

1.9 

1.7 

1.4 

bw 

le3 

1.5 

7.9 

10. 

3.6 

6.4 

CPUTime 

secs. 

1.1 

2.2 

10. 

11. 

Table  8:  Average  performance  of  the  algorithms.  The  arithmetic  mean  of  each  metric  is 
calculated  from  the  unnormalized  values  of  that  metric  for  the  test  problems. 


22 


number  of  vertices  (1000  seems  to  be  sufficient).  Then  it  computes  the  Fiedler  vector 
on  the  coarse  grid,  orthogonalizing  only  for  eigenvectors  corresponding  to  small  eigen¬ 
values.  Then  the  coarse  grid  is  refined  back  to  the  original  grid  and  the  eigenvector  is 
refined  using  Rayleigh  Quotient  Iteration  (RQI).  This  refinement  is  the  dominant  cost 
of  the  whole  process.  During  the  coarsening,  we  compute  generalized  eigenvectors  of  the 
weighted  Laplacians  of  the  coarse  graphs  from  the  equation  Ax  —  XDx,  where  D  is  the 
diagonal  matrix  of  vertex  weights.  This  feature,  obtained  by  turning  on  the  parameter 
MAKE.VWGTS,  speeds  up  the  eigenvector  computation  substantially. 

Two  other  parameters,  EIGEN.TOLERAMCE  and  COARSE_NLEVEL_RQI,  control  how  ac¬ 
curately  eigenvectors  are  computed  and  how  many  levels  of  graph  refinement  occur 
before  the  approximate  eigenvector  is  refined  using  RQI,  respectively.  We  set  the  value 
of  EIGEN_TOLERANCE  to  10“^,  and  it  was  very  effective  in  reducing  cpu-time.  Even  in  the 
case  where  this  tolerance  induces  misconvergences,  the  spectral  ordering  is  still  good  and 
the  hybrid  ordering  even  better  for  most  problems.  The  COARSE_NLEVEL_RC)I  parameter 
didn’t  have  much  effect,  so  we  used  the  program’s  default  value  of  2. 

5.2  Results 

We  consider  five  ordering  algorithms  RCM,  Sloan  with  unnormalized  weights  Wi  =  2, 
W2  =  1,  Sloan  with  normalized  weights  {Wi  =8,  W2  —  1  for  problems  in  Class  1,  and 
Wi  —  1,  W2  —  2  for  problems  in  Class  2),  spectral,  and  hybrid  (normalized  weights  Wi  — 
W2  =  W3  =  1  for  Class  1  problems,  Wi  =  1,  W2  =  W3  =  2  for  Class  2  problems).  When 
we  refer  to  the  Sloan  algorithm  without  mentioning  the  weights,  we  mean  the  algorithm 
with  normalized  weights.  We  have  compared  the  quality  and  time  requirements  of  these 
algorithms  on  eighteen  problems  (see  Table  5.1).  The  problems  are  chosen  to  represent 
a  variety  of  application  areas:  structural  analysis,  fluid  dynamics,  and  linear  programs 
from  stochastic  optimization  and  multicommodity  flows.  The  complete  set  of  results  for 
RCM  are  shown  in  Table  5.2;  for  other  algorithms,  results  normalized  with  respect  to 
RCM  are  presented  in  Tables  5.3  through  5.7. 

A  comparison  of  the  mean  performance  of  the  various  algorithms  is  included  in  Ta¬ 
ble  5.8.  The  CPU  time  for  only  one  of  the  Sloan  algorithms  is  shown  because  the  two 
algorithms  have  identical  running  times  since  they  differ  only  in  the  choice  of  weights. 
The  values  in  this  table  are  computed  by  taking  arithmetic  means  of  the  (unnormalized) 
values  of  each  metric  over  the  problems  in  the  test  collection.  Values  normalized  with 
respect  to  the  RCM  algorithm  (reported  in  Tables  5.3  through  5.7  should  not  be  used 
to  compute  the  arithmetic  mean,  since  the  arithmetic  mean  of  normalized  data  is  incon¬ 
sistent  in  the  sense  that  the  rankings  of  the  algorithms  could  depend  on  the  algorithm 
chosen  as  the  reference  algorithm.  This  is  because  the  larger  ratios  in  the  normalized 
data  strongly  influence  the  arithmetic  mean.  The  reader  can  compute  the  unnormalized 
data  from  the  results  for  RCM  included  in  Table  5.2  and  the  tables  with  the  normalized 
data. 

Initially  we  discuss  the  results  on  the  uncompressed  graphs,  since  most  of  the  graphs 
in  our  test  collection  did  not  gain  much  from  compression.  We  discuss  later  in  this 
section  the  three  problems  that  exhibited  good  gains  from  compression. 

The  envelope  parameters  and  times  reported  in  the  tables  are  normalized  with  re¬ 
spect  to  the  values  obtained  from  RCM.  For  the  Sloan  algorithm,  two  sets  of  values 
are  reported:  the  first  is  from  the  unnormalized  weights  Wi  —  2,  W2  =  1,  and  the 
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second  from  the  normalized  weights  for  Class  1  and  Class  2  problems.  The  normalized 
Sloan  algorithm  is  labeled  by  the  column  NSLOAN  in  Table  5.3,  and  the  number  in  the 
parenthesis  (i)  indicates  the  class  to  which  a  problem  belongs  to.  The  results  for  the 
compressed  problems  are  indicated  by  the  last  three  rows. 

The  Sloan  algorithm  with  the  normalized  weights  reduces  the  mean-square  wavefront 
on  average  to  23%  of  that  of  RCM;  when  unnormalized  weights  are  used  in  the  Sloan 
algorithm,  the  mean  square  wavefront  is  36%  of  that  of  RCM.  (Henceforth,  a  performance 
figure  should  be  interpreted  to  be  the  average  value  for  the  problems  in  the  test  collection; 
we  shall  not  state  this  explicitly.)  The  hybrid  reduces  mean-square  wavefront  to  14% 
of  that  of  RCM,  and  to  60%  of  that  of  (normalized)  Sloan.  The  hybrid  algorithm 
computes  the  smallest  mean  square  wavefront  for  all  but  three  of  the  eighteen  problems. 
Note  that  even  for  the  problems  where  the  spectral  algorithm  does  poorly  relative  to  the 
Sloan  algorithm,  the  post-processing  enables  the  hybrid  algorithm  to  compute  relatively 
small  wavefronts.  In  general,  the  spectral  and  Sloan  algorithms  tend  to  vie  for  second 
place  with  RCM  finishing  fourth. 

These  algorithms  also  yield  smaller  maximum  wavefront  sizes  than  RCM.  The  nor¬ 
malized  Sloan  algorithm  yields  values  about  52%  of  RCM,  while  the  hybrid  computes 
values  about  38%  of  RCM.  Thus  these  algorithms  lead  to  reduced  storage  requirements 
for  frontal  factorization  methods. 

The  results  for  the  envelope  size  are  similar.  The  hybrid,  on  average,  reduces  the 
envelope  size  to  37%  of  that  of  the  RCM  ordering,  and  to  73%  of  that  of  the  normalized 
Sloan  algorithm. 

The  Sloan,  spectral,  and  the  hybrid  algorithms  all  reduce  the  wavefront  size  and 
envelope  size  at  the  expense  of  increased  bandwidth.  This  is  expected  for  the  Sloan 
algorithm  since  Figures  5  and  6  show  that  the  weights  yielding  small  wavefront  sizes  are 
quite  different  from  the  weights  for  small  bandwidth.  It  is  also  not  surprising  for  the 
spectral  and  the  hybrid  algorithms  since  their  objective  functions,  2-sum  (for  spectral, 
see  [16])  and  wave  front  size  (for  the  hybrid)  differ  from  the  bandwidth. 

On  these  test  problems,  our  efficient  implementation  of  the  Sloan  algorithm  requires 
on  average  only  2.1  times  that  of  the  time  taken  by  the  RCM  algorithm.  The  hybrid 
algorithm  requires  about  5.0  times  the  time  taken  by  the  Sloan  algorithm  on  the  av¬ 
erage.  This  ratio  is  always  greater  than  one,  since  the  hybrid  algorithm  uses  second 
step  of  the  Sloan  algorithm  (numbering  the  vertices)  to  refine  the  spectral  ordering,  and 
the  eigenvector  computation  is  much  more  expensive  than  the  first  step  of  the  Sloan 
algorithm  (the  pseudo-diameter  computation).  We  believe  that  these  time  requirements 
are  small  for  the  applications  that  we  consider:  preconditioned  iterative  methods  and 
frontal  solvers. 

Gains  from  Compressed  Graphs.  As  discussed  in  Subsection  3.1,  the  use  of  the 
supervariable  connectivity  graph  [11]  (called  the  compressed  graph  by  Ashcraft  [2])  can 
lead  to  further  gain  in  the  execution  times  of  the  algorithms.  Only  three  of  the  problems, 
SKIRT,  NASARB,  BCSSTK30,  compressed  well.  This  is  because  many  of  the  multicomponent 
finite  element  problems  in  our  test  set  had  only  one  node  representing  the  multiple 
degrees  of  freedom  at  that  node.  The  compression  feature  is  an  important  part  of  many 
software  packages  for  solving  PDF’s,  since  it  results  in  reduced  running  times  and  storage 
overheads,  and  our  results  also  show  impressive  gains  from  compression. 

Three  problems  in  our  test  set  compressed  well:  SKIRT,  NASARB,  and  BCSSTK30. 
Results  for  these  problems  are  shown  in  the  last  three  rows  of  each  table.  The  numbers 
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of  multivertices  and  edges  in  the  compressed  graphs  are  also  shown.  For  these  three 
problems,  compression  speeds  up  the  Sloan  algorithm  on  average  by  a  factor  of  nearly 
5,  and  the  hybrid  algorithm  by  a  factor  of  4.6. 

Compression  improves  the  quality  of  the  Sloan  algorithm  for  these  three  problems, 
and  does  not  have  much  impact  on  the  hybrid  algorithm.  This  improved  quality  of  the 
compressed  Sloan  algorithm  follows  from  our  choice  of  parameters  in  the  compressed 
algorithm  to  correspond  exactly  to  their  values  in  the  uncompressed  graph.  However, 
on  NASARB,  the  spectral  envelope  parameters  deteriorate  upon  compression.  We  do  not 
know  the  reason  for  this,  but  it  could  be  due  to  the  poorer  quality  of  the  eigenvector 
computed  for  the  weighted  problem.  In  any  case,  the  compressed  hybrid  algorithm 
recoups  most  of  this  deterioration. 

6  Applications 

This  section  discusses  preliminary  evidence  demonstrating  the  applicability  of  the  order¬ 
ings  we  generated.  In  Section  6.1  we  describe  how  a  reduction  in  mean  square  wavefront 
directly  translates  into  a  greater  reduction  in  cpu-time  in  a  frontal  factorization.  We 
also  discuss  the  impact  of  these  orderings  on  incomplete  Cholesky  (IC)  preconditioned 
iterative  solvers  in  Section  6.2. 

6.1  Frontal  Methods 

The  work  in  a  frontal  Cholesky  factorization  algorithm  is 

1  " 

work(^)  =  -X^|wfi(yl)|(|wfi(yl)|  +  3). 

^  i=:l 

Hence  a  reduction  in  the  mean-square  wavefront  leads  to  fewer  flops  during  Cholesky 
factorization.  Duff,  Reid,  and  Scott  [11]  have  reported  that  Sloan  orderings  lead  to  faster 
frontal  factorization  times  than  RCM  orderings.  Barnard,  Pothen  and  Simon  [4]  have 
reported  similar  results  when  spectral  orderings  are  used. 

Two  problems  were  run  by  Dr.  Jennifer  Scott  on  a  single  processor  of  a  Cray-J90 
using  the  Harwell  frontal  factorization  code  MA42.  The  matrix  values  were  generated 
randomly.  (The  orderings  used  were  obtained  earlier  than  the  results  reported  in  Ap¬ 
pendix  A;  however,  these  results  suffice  to  show  the  general  trends.)  The  results  in 
Table  9  show  a  general  correlation  between  mean  square  wavefronts  (proportional  to 
flops)  and  factorization  times.  The  spectral  ordering  enables  the  factorization  to  be 
computed  about  5.2  times  faster  than  the  Sloan  ordering  for  the  BCSSTK30  problem;  this 
ratio  is  1.8  for  the  SKIRT  problem.  The  hybrid  does  not  improve  factorization  times  over 
the  spectral  ordering  for  these  problems. 

6.2  Incomplete  Cholesky  Preconditioning 

In  this  section  we  report  preliminary  experiments  on  the  influence  of  our  orderings 
on  preconditioned  conjugate  gradients  (CG).  We  precondition  CG  with  an  Incomplete 
Cholesky  factorization  (IC(A;))  that  controls  fc,  the  level  of  the  fill  introduced. 
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Sun  SPARC20 

Ordering 

Time 

Cray-J90 

Frontal  Solve 
Time  Flops 

Initial 

0 

1106  8.7e-l-"l0 

RCM 

3.7 

1649  1.4e+ll 

bcsstkSO  Sloan 

6.1 

989  7.5e-fl0 

Spectral 

11.9 

188  l.le-t-10 

Hybrid 

14.6 

205  l.le+10 

Initial 

0 

2427  2.1e-kll 

RCM 

5.0 

2233  1.9e+ll 

skirt  Sloan 

8.4 

1754  1.4e-|-ll 

Spectral 

18.6 

979  7.6e+10 

Hybrid 

22.6 

980  7.3e-fl0 

Table  9:  Results  of  two  problems  on  a  CRAY-J90  using  MA42.  Times  reported  are  in 
seconds. 


Since  the  envelope  is  small,  we  confine  fill  to  a  limited  number  of  positions,  and 
hope  to  capture  more  of  the  character  of  the  problem  with  fewer  levels  of  fill.  However, 
a  tighter  envelope  is  only  one  of  the  factors  that  affect  convergence.  For  instance, 
orderings  must  respect  numerical  anisotropy  for  fast  convergence. 

Our  preliminary  results  have  been  mixed.  In  Table  6.2  we  show  information  pertain¬ 
ing  to  two  problems  that  are  representative  of  our  data.  It  is  worth  noting  how  strongly 
the  norm  of  the  remainder  matrix  for  a  given  ordering  is  a  predictor  of  iteration  counts. 
The  BODY.Y-5  problem  shows  that  the  Sloan  ordering  can  be  very  effective  in  reducing 
the  iteration  count.  This  problem  is  a  2-dimensional  mesh  with  an  aspect  ratio  of  10“®. 
In  the  case  of  poor  aspect  ratios,  a  weighted  Laplacian  should  be  more  appropriate  for 
computing  the  spectral  ordering,  but  we  defer  this  topic  for  future  research.  Duff  and 
Meurant  [8]  indicate  that  ordering  becomes  more  significant  when  the  problem  becomes 
more  difficult  (discontinuous  coefficients,  anisotropy,  etc.). 

Another  problem  from  the  Harwell-Boeing  collection  BCSSTK17  did  not  converge 
quickly  for  levels  of  fill  below  two,  indicating  that  it  is  a  difficult  problem.  The  rate  of 
convergence  at  two  levels  of  fill  shows  that  the  new  ordering  reduces  the  iteration  count 
by  almost  half  that  of  its  closest  competitor.  Since  envelope  reduction  concentrates  fill, 
it  is  possible  that  the  benefits  of  the  hybrid  ordering  are  maximized  when  more  than 
one  level  of  fill  is  allowed. 


7  Conclusions 

We  have  observed  that  problems  fall  into  two  distinct  classes  when  we  examine  how  enve¬ 
lope  parameters  vary  asymptotically  as  a  function  of  the  weights  in  the  Sloan  algorithm. 
Small  wavefronts  are  obtained  for  the  first  class  of  problems  when  the  the  “local”  term 
in  the  priority  function  is  weighted  large  relative  to  the  “global”  term;  for  the  second 
class  of  problems,  the  “global”  term  should  be  weighted  to  be  more  important.  The 
bandwidth  behaves  contrary  to  the  other  envelope  parameters  for  the  first  class,  but  its 
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RCM 

Ordering 

Sloan  Spectral 

Hybrid 

body.y-5 

3,608 

2,598 

9,166 

7,276 

\V\  =  18,589 

nnz(L) 

73,721 

73,721 

73,721 

73,721 

1^1  =  55,132 

iteration  count 

756 

497 

1,203 

1,009 

Level  0 

cpu  time 

1,103 

726 

1,715 

1,405 

flops 

6.8e+08 

4.5e+08 

l.le+09 

9.1e+08 

Level  2 

IlflllF 

1,430 

885 

988 

-  501 

nnz(L) 

128,854 

126,141 

128,121 

126,319 

iteration  count 

457 

231 

356 

265 

cpu  time 

726 

376 

564, 

_ 422 

flops 

5.1e+08 

2.6e“|“08 

4.0e+08 

2.9e-f"08 

bcsstklT 

6.5e+08 

6.5e+08 

7.3e+08 

1.9e+09 

|y|  =  10,974 

nnz(L) 

470,304 

473,017 

486,524 

474,935 

|£:|  =  208,838 

iteration  count 

422 

323 

320 

179 

Level  2 

cpu  time 

1131 

894 

871 

503 

flops 

l.le+09 

9.5e+08 

9.5e+08 

5.2e-|-08 

Table  10:  Convergence  of  preconditioned  CG  on  body.y-5  and  bcsstkl7. 
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behavior  is  similar  to  the  others  for  the  second  class.  This  is  understandable  since  the 
bandwidth  is  a  global  property  of  an  ordering  of  a  graph. 

A  new  normalized  scheme  for  choosing  weights  according  to  the  problem  class  im¬ 
proves  the  quality  of  the  orderings  computed  by  the  Sloan  algorithm.  Our  efficient 
implementation  of  the  Sloan  algorithm  on  the  average  required  only  2.1  times  the  time 
taken  by  ROM,  while  producing  mean  square  wavefronts  about  three  times  smaller  than 
those  obtained  from  ROM.  Since  the  cost  of  the  ROM  algorithm  is  a  few  breadth-first- 
searches  through  the  graph,  these  results  imply  that  the  Sloan  algorithm  is  an  effective 
combinatorial  algorithm  for  computing  envelope  and  wavefront  reducing  orderings. 

Our  modified  Sloan  algorithm  for  compressed  graphs  is  very  fast  on  problems  that 
exhibit  good  compression.  Since  this  algorithm  mimics  the  computations  that  would  be 
performed  on  the  original  unweighted  graph,  the  faster  algorithm  does  not  sacrifice  the 
quality  of  the  orderings. 

We  have  also  described  a  hybrid  algorithm  that  combines  a  spectral  algorithm  with 
a  refinement  step  using  a  modified  Sloan  algorithm.  The  hybrid  algorithm  further  im¬ 
proves  the  good  envelope  and  wavefront  reducing  properties  of  the  spectral  algorithm. 
It  produces  orderings  of  better  quality  (about  40%  of  the  normalized  Sloan)  but  at  a 
cost  greater  by  factor  of  five  than  the  HeapSloan  algorithm.  In  applications  such  as 
frontal  factorization  schemes,  where  the  time  taken  to  compute  an  ordering  is  insignif¬ 
icant  relative  to  the  subsequent  factorization  step,  or  for  nonlinear  problems  where  the 
cost  of  the  ordering  can  be  amortized  over  several  linear  solves,  the  hybrid  algorithm 
is  an  attractive  choice.  However,  in  other  applications  where  the  tradeoff  between  the 
quality  of  the  ordering  versus  the  time  required  for  computing  the  ordering  favors  fast 
ordering  algorithms,  the  HeapSloan  is  attractive. 

In  this  work  we  have  primarily  focused  on  improving  the  quality  and  time  require¬ 
ments  of  the  Sloan  algorithm.  With  similar  attention  to  the  eigencomputation  of  the 
spectral  algorithm  we  believe  that  the  time  requirements  of  the  spectral  algorithm  could 
be  reduced,  and  thereby  the  hybrid  algorithm  could  be  made  more  competitive.  An 
interesting  question  is  whether  one  can  design  algorithms  that  compute  orderings  with 
the  same  quality  as  the  hybrid  but  at  the  cost  of  the  Sloan  algorithm.  Boman  and  Hen¬ 
drickson  [5]  have  recently  described  an  attempt  in  this  direction,  a  multilevel  algorithm 
for  wavefront  reduction. 

Much  more  work  is  needed  to  understand  the  influence  of  these  orderings  on  the 
convergence  behavior  of  preconditioned  iterative  solvers. 

Our  software  implementing  these  algorithms  is  available  with  three  different  inter¬ 
faces:  a  stand-alone  code,  a  code  that  can  be  called  within  Matlab,  and  another  callable 
within  PETSc.  These  codes  are  available  from  us  upon  request  by  electronic  mail. 
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A  Time  Complexity 

In  this  Appendix  we  analyze  the  computational  complexity  of  the  two  Sloan  implemen¬ 
tations.  The  analysis  has  the  interesting  feature  that  the  time  complexity  depends  on 
the  maximum  wavefront  size,  a  quantity  related  to  the  mean  square  wavefront  that  the 
algorithm  is  seeking  to  reduce.  Nevertheless,  it  is  possible  to  get  a  priori  complexity 
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bounds  for  problems  with  good  separators.  The  results  clearly  show  the  overwhelm¬ 
ing  superiority  of  the  heap  implementation;  an  analysis  of  the  complexity  of  the  Sloan 
algorithm  is  not  available  in  earlier  published  work. 

The  major  computational  difference  lies  in  the  implementation  of  the  priority  queue 
(see  Section  3.2).  We  call  these  two  implementations  ArraySloan  and  HeapSloan  accord¬ 
ing  to  the  data  structure  used  to  implement  the  queue. 

For  the  array,  the  operations  deleteO,  insertO,  and  increment  .priority  ()  are 
all  0(1)  operations,  but  the  max.priorityO  operation  (finding  the  vertex  with  the 
maximum  priority)  is  0(m),  where  m  is  the  size  of  the  queue.  All  operations  on  the 
binary  heap  are  (9(logm)  except  max.priorityO,  which  is  0(1). 

To  continue  with  our  analysis,  we  will  refer  to  the  algorithm  as  shown  in  Figure  4.  It 
is  immediately  clear  that  the  function  far.neighborsO  (lines  26 — 29)  is  0(deg(y))  for 
ArraySloan.  We  can  bound  this  by  A  =  maxi<i<„(deg(i)).  Similarly,  far.neighborsO 
for  HeapSloan  is  0(A  *  log  m),  where  m  is  the  maximum  size  of  the  priority  queue. 

The  Sloan  function  (lines  1-25)  has  three  loops:  the  initialization  loop  (lines  1-4), 
the  outer  ordering  loop  (lines  6-25),  and  the  inner  ordering  loop  (lines  9-23).  The 
initialization  loop  is  the  same  for  either  implementation,  and  is  easily  seen  to  require 
0(|F^|)  time. 

Consider  now  the  ArraySloan  implementation.  For  each  step  of  the  outermost  loop 
starting  at  line  6,  it  must  find  and  remove  the  vertex  of  maximum  priority,  requiring 
0(m)  time.  The  inner  loop  is  executed  at  most  A  times.  The  worst  case  for  the  inner 
loop  is  when  the  priority  is  incremented  and  the  far. neighbors  routine  is  called,  and 
this  requires  (9(A)  time.  Thus  the  worst  case  running  time  for  the  ordering  loop  is 
0{\V\  A^)).  For  the  entire  algorithm  it  is  0{\V\*  {m-\-  A^)  -f  \E\). 

For  the  HeapSloan  implementation,  at  each  step  of  the  outermost  loop  starting  at 
line  6,  the  algorithm  must  delete  the  vertex  of  maximum  priority,  and  then  rebuild  the 
heap;  this  takes  (9(logm)  time.  The  inner  loop  is  executed  at  most  A  times.  The 
worst  case  for  the  inner  loop  is  when  the  priority  is  incremented  and  the  f  ar.neighbors 
function  is  called.  This  time  is  0(  A  *  log  m).  The  worst  case  time  complexity  for  the 
ordering  loop  of  HeapSloan  is  thus  0{V\  *  A^  *  logm).  For  the  entire  algorithm  it  is 
(9(|y|  *  A^  *  logm  +  |£^|). 

These  bounds  can  be  simplified  further.  The  maximum  size  of  the  queue  can  be 
bounded  by  the  smaller  of  (1)  the  product  of  the  maximum  wavefront  of  the  reordered 
graph  and  the  maximum  degree,  and  (2)  the  number  of  vertices  n.  Then  the  com¬ 
plexity  of  ArraySloan  is  (9(|y  |  *  A  maxwf),  while  the  complexity  of  HeapSloan  is 
(9(|V|  *  A^  *  log(maxwf  *  A)).  If  we  consider  degree-bounded  graphs,  as  finite  element 
or  finite  difference  meshes  tend  to  be,  then  the  ArraySloan  implementation  has  time 
complexity  0{\V\*  maxwf  +  |£^|),  while  the  time  complexity  of  the  HeapSloan  imple¬ 
mentation  is  0(|I4|  *  log(maxwf)  -f  |-£^|). 

These  bounds  have  the  unsatisfactory  property  that  they  depend  on  the  maximum 
wavefront,  a  quantity  that  the  algorithm  seeks  to  compute  and  to  reduce.  However,  it 
is  possible  to  remove  this  dependence  from  the  bounds  for  important  classes  of  finite 
element  meshes,  as  we  illustrate  now. 

The  class  of  d-dimensional  overlap  graphs  (where  d>2)  whose  degrees  are  bounded 
includes  finite  element  graphs  with  bounded  aspect  ratios  embedded  in  d  dimensions  and 
all  planar  graphs  [34].  Overlap  graphs  have  (9(n^‘^“9/<^)  separators  that  split  the  graph 
into  two  parts  with  the  ratio  of  their  sizes  at  most  (d  -f  l)/(d  -f  2).  Hence  the  maximum 
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wavefront  can  be  bounded  by  for  a  modified  nested  dissection  ordering  that 

orders  one  part  first,  then  the  separator,  and  finally  the  second  part.  The  Sloan  and 
other  envelope-reducing  algorithms  tend  to  do  better  than  this  modified  nested  dissection 
ordering,  so  we  can  assume  that  the  maximum  wavefront  for  the  Sloan  algorithm  is  also 
bounded  by  this  bound. 

With  the  above  assumption,  we  can  conclude  that  the  HeapSloan  implementation 
requires  0{n  log  n)  time  while  the  ArraySloan  implementation  requires  time 

for  a  d-dimensional  overlap  graph.  For  a  planar  mesh  (d  =  2),  the  ArraySloan  implemen¬ 
tation  requires  (9(n^/^)-time,  while  for  a  three  dimensional  mesh  with  bounded  aspect 
ratios  (d  =  3),  its  time  complexity  is  0(n®/^). 
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